In the dynamic landscape of business, the ability to predict future trends and outcomes is invaluable. Forecasting stands as a cornerstone in strategic planning, offering a glimpse into potential future scenarios based on current data and historical trends. Its significance spans various sectors, from finance and marketing to supply chain management and beyond. Among the many applications of forecasting, the domain of box office predictions provides a compelling illustration of its utility and impact.
In this project, we extract box office information from BoxOfficeMojo.com (https://www.boxofficemojo.com), aiming to provide weekly box office predictions using time series analysis. Our approach starts with thorough exploratory data analysis, coupled with comprehensive data cleaning and preparation. Significant external events, such as the COVID-19 pandemic, are critical in influencing box office trends, and we factor in these and other external elements. The project tackles complex, real-world data, presenting a fascinating and challenging dataset. We employ a SARIMAX model, effectively capturing the seasonal patterns, trends, and cyclical nature inherent in the data. Looking ahead, further exploration into the influence of social media trends on box office performance could enhance the accuracy of our forecasts.
# import libraries
import sqlite3
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import matplotlib.colors as mcolors
from statsmodels.tsa.statespace.sarimax import SARIMAX
import seaborn as sns
The data is downloaded from the accompanying database. Initial examination reveals the necessity of data cleaning. In particular, we use the following techniques:
# Data download from database
db_path = 'box_office_data.db'
try:
conn = sqlite3.connect(db_path)
query = "SELECT * FROM box_office_data"
df = pd.read_sql_query(query, conn)
except Exception as e:
print(f"An error occurred: {e}")
finally:
conn.close()
df.head(15)
| Date | Weekday | DayOfYear | DailyGross | DailyChange | WeeklyChange | NoReleases | No1Movie | GrossOfNo1Movie | SpecialOccasion | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Jan 9 2024 | Tuesday | 9 | 6685011 | +23% | -66.6% | 22 | Anyone But You | 1268320 | |
| 1 | Jan 8 2024 | Monday | 8 | 5435026 | -71.3% | -81.4% | 24 | Anyone But You | 1003785 | |
| 2 | Jan 7 2024 | Sunday | 7 | 18939826 | -40.4% | -17.9% | 34 | Wonka | 3638210 | |
| 3 | Jan 6 2024 | Saturday | 6 | 31767466 | +28.6% | -20.7% | 34 | Wonka | 6130408 | |
| 4 | Jan 5 2024 | Friday | 5 | 24699285 | +97.1% | -33.9% | 34 | Night Swim | 5246275 | |
| 5 | Jan 4 2024 | Thursday | 4 | 12533498 | -12.1% | -62.3% | 39 | Wonka | 2703155 | |
| 6 | Jan 3 2024 | Wednesday | 3 | 14266846 | -28.7% | -57.9% | 39 | Wonka | 3152613 | |
| 7 | Jan 2 2024 | Tuesday | 2 | 20006635 | -31.5% | -52.1% | 39 | Wonka | 4601670 | |
| 8 | Jan 1 2024 | Monday | 1 | 29208719 | -19.3% | -48.5% | 41 | Wonka | 6636962 | New Year |
| 9 | Dec 31 2023 | Sunday | 365 | 23078184 | -17.5% | -52.6% | 39 | Wonka | 5208897 | New Year |
| 10 | Dec 30 2023 | Saturday | 364 | 40050370 | +7.2% | +38% | 39 | Wonka | 8637841 | |
| 11 | Dec 29 2023 | Friday | 363 | 37348409 | +12.3% | -6.4% | 39 | Wonka | 8630268 | |
| 12 | Dec 28 2023 | Thursday | 362 | 33261609 | -1.9% | +168.2% | 40 | Wonka | 7988504 | |
| 13 | Dec 27 2023 | Wednesday | 361 | 33892628 | -18.9% | +251.5% | 40 | Wonka | 8135639 | |
| 14 | Dec 26 2023 | Tuesday | 360 | 41788862 | -28.6% | +259.9% | 40 | Wonka | 8970413 |
# 911 date was in the wrong format
date_wrong = df[df.Date == 'Sep 119 2001'].index.values[0]
df.loc[date_wrong,'Date'] = 'Sep 11 2001'
# Transform the dates into datetime format
df['Date'] = pd.to_datetime(df['Date'], format='%b %d %Y')
# Sort by date
df.sort_values(by='Date',ascending=False,inplace=True)
# Drop wrong value from dataset
df.drop([13064],inplace=True)
# Reset index
df.reset_index(inplace=True)
df.drop(columns=['index'],inplace=True)
# Check for missing values (does not check for missing rows)
df.isna().sum()
Date 0 Weekday 0 DayOfYear 0 DailyGross 0 DailyChange 0 WeeklyChange 0 NoReleases 0 No1Movie 0 GrossOfNo1Movie 0 SpecialOccasion 0 dtype: int64
# Drop daily and weekly change
df.drop(columns=['DailyChange','WeeklyChange'],inplace=True)
Since much of the data from before 1997 is missing and the data might be very different, we will ignore all data from before 1997. Additionally, some data during March 2020 is missing. We will interpolate this data. However, incorporating data from 2020 and 2021 decreases the quality of the forecast due to the strong external factors influencing the behavior.
# Create new column for the year
df['Year'] = df['Date'].dt.year
# Count the number of entries per year
data_per_year = df.groupby('Year').size()
# Plotting
plt.figure(figsize=(12, 6))
# Plot all data in blue
plt.bar(data_per_year.index, data_per_year.values, color='skyblue', label='Complete data')
# Overlay in red where entries do not cover all days
plt.bar(data_per_year[data_per_year < 365].index, data_per_year[data_per_year < 365].values, color='palevioletred', label='< 365 Entries')
# Overlay in black for years with less than 300 entries
plt.bar(data_per_year[data_per_year < 300].index, data_per_year[data_per_year < 300].values, color='goldenrod', label='< 300 Entries')
plt.title('Number of Data Entries per Year')
plt.xlabel('Year')
plt.ylabel('Number of Entries')
plt.legend()
plt.show()
df_2020 = df[df['Year']==2020].copy()
df_2020['Date_Shift1'] = df_2020['DayOfYear'].shift(1)
df_2020 = df_2020.dropna()
df_2020['TimeDelta'] = df_2020['Date_Shift1'] - df_2020['DayOfYear']
df_2020.head()
| Date | Weekday | DayOfYear | DailyGross | NoReleases | No1Movie | GrossOfNo1Movie | SpecialOccasion | Year | Date_Shift1 | TimeDelta | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1117 | 2020-12-30 | Wednesday | 365 | 2876460 | 16 | Wonder Woman 1984 | 1451131 | COVID | 2020 | 366.0 | 1.0 |
| 1118 | 2020-12-29 | Tuesday | 364 | 3063773 | 17 | Wonder Woman 1984 | 1655240 | COVID | 2020 | 365.0 | 1.0 |
| 1119 | 2020-12-28 | Monday | 363 | 3085353 | 17 | Wonder Woman 1984 | 1822877 | COVID | 2020 | 364.0 | 1.0 |
| 1120 | 2020-12-27 | Sunday | 362 | 5302360 | 27 | Wonder Woman 1984 | 3408342 | COVID | 2020 | 363.0 | 1.0 |
| 1121 | 2020-12-26 | Saturday | 361 | 8403557 | 27 | Wonder Woman 1984 | 5811188 | COVID | 2020 | 362.0 | 1.0 |
# Plotting
plt.figure(figsize=(12, 6))
# Plot all data in blue
plt.bar(df_2020.DayOfYear, df_2020.TimeDelta, color='red', label='Attention')
plt.bar(df_2020[df_2020['TimeDelta']<2].DayOfYear, df_2020[df_2020['TimeDelta']<2].TimeDelta, color='blue', label='Good values')
plt.title('Time Until next data entry')
plt.xlabel('Day of Year')
plt.ylabel('Number of Days')
plt.legend()
plt.show()
df_2020[df_2020['TimeDelta'] >= 2]
| Date | Weekday | DayOfYear | DailyGross | NoReleases | No1Movie | GrossOfNo1Movie | SpecialOccasion | Year | Date_Shift1 | TimeDelta | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1389 | 2020-03-27 | Friday | 87 | 192 | 1 | Be Natural: The Untold Story of Alice Guy-Blaché | 192 | COVID | 2020 | 94.0 | 7.0 |
df_covid = df[df['SpecialOccasion']=='COVID']
# Set 'Date' as the index
df.set_index('Date', inplace=True)
# Filter data from 1997 onwards
df_post_1997 = df['1997':]
df_missing = pd.DataFrame({'Date': pd.date_range('2020-03-28', '2020-04-02', freq='D'),
'Weekday': ['Saturday','Sunday','Monday','Tuesday','Wednesday','Thursday'],
'DayOfYear': [k for k in range(88,94)],
'DailyGross': ['nan' for k in range(6)],
'NoReleases': ['nan' for k in range(6)],
'No1Movie': ['nan' for k in range(6)],
'SpecialOccasion': ['COVID' for k in range(6)],
'Year': ['2020' for k in range(6)]
})
df_missing.set_index('Date', inplace=True)
# Combine the original DataFrame with the missing dates DataFrame
df_combined = pd.concat([df, df_missing]).sort_index()
# Convert integer values to float for interpolation
df_combined['DailyGross'] = df_combined['DailyGross'].astype(float)
df_combined['NoReleases'] = df_combined['NoReleases'].astype(float)
# Linear interpolation of the missing data
df_2020 = df_combined.loc['2020'].interpolate()
df_combined.info()
<class 'pandas.core.frame.DataFrame'> DatetimeIndex: 13456 entries, 1977-05-25 to 2024-01-21 Data columns (total 8 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Weekday 13456 non-null object 1 DayOfYear 13456 non-null int64 2 DailyGross 13450 non-null float64 3 NoReleases 13450 non-null float64 4 No1Movie 13456 non-null object 5 GrossOfNo1Movie 13450 non-null float64 6 SpecialOccasion 13456 non-null object 7 Year 13456 non-null object dtypes: float64(3), int64(1), object(4) memory usage: 946.1+ KB
# Reintegrate interpolated data back into the main DataFrame
df_combined.update(df_2020)
After preprocessing the data, we visualize the time series to get an overview over trends, seasonality and cycles. The data shows some very interesting patterns. There are clear weekly patterns (peak on weekends), some changes that look seasonal and when a new movie is antcipated, there is a sudden jump. This year Barbenheimer, Covid clearly visible, overall trend increasing, in Dec 2021 Spider Man no way home, in may 2019 Avengers Endgame etc, Star Wars Dec 2017 and Dec 2015. This year is missing jump due to strike in Hollywood. Most likely, we will ignore 2020 and 2021 for our analysis as they are very unlike the rest of the data and might make predictions worse. We will later compare predictions with and without this data. Also might drop data from before 2002 as it looks different. (Note how 911 caused people to skip crowded places for a bit).
Another thought: Jumps might be correlated with how long the no1 movie has been reigning. We might want to take this into consideration as a feature for the prediction.
After preprocessing our dataset, we embark on visualizing the time series data to gain insights into its underlying trends, seasonality, and cycles. Our analysis uncovers fascinating patterns within the dataset. Notably, we observe distinct weekly patterns, with peaks occurring on weekends. Additionally, there are discernible changes that exhibit seasonal characteristics. Further, whenever the release of a highly anticipated movie is imminent, we observe sudden spikes in the data. Notably, in December 2021, the release of "Spider-Man: No Way Home" caused a significant surge in box office figures, reminiscent of the effect observed in May 2019 with "Avengers: Endgame" as well as in May 2018 with "Avengers: Infinity War." Other standout moments include the December releases of the new "Star Wars" movies in 2017 and 2015. This year, "Barbie" and "Oppenheimer" produced a characteristic peak in July. However, it's worth noting that this year lacks the characteristic jump in box office revenue due to the 2023 Writers Guild of America strike.
Additionally, several noteworthy events stand out. The impact of the COVID-19 pandemic is unmistakable, as it led to substantial deviations in the data. Consequently, we are inclined to exclude the years 2020 and 2021 from our primary analysis, as they diverge significantly from the typical data patterns and may adversely impact predictive models. Later in our analysis, we will make comparisons between predictions with and without this data. Additionally, we discard data points preceding 2002, as they do not exhibit the same patterns as modern data.
For more accurate predictions, we consider that these significant jumps in box office revenue might be correlated with the duration of a movie's reign as the number one film. Consequently, we incorporate this as an additional feature in our prediction models, recognizing its potential impact on forecasting accuracy.
fig, axs = plt.subplots(27,1,figsize=(18,120))
for year in range(1997,2024):
axs[year-1997].plot(df_combined.loc[str(year)].index, df_combined.loc[str(year)]['DailyGross'], label='Box Office Data',color='darkmagenta',linewidth=2,marker='o')
axs[year-1997].title.set_text('Daily Box Office Total' + str(year))
axs[year-1997].set_ylim([0, 120000000])
plt.show()
# Applying one-hot-encoding
df_no1_movie_one_hot = pd.get_dummies(df_combined, columns=['No1Movie'])
# Display the resulting DataFrame
df_no1_movie_one_hot.head()
| Weekday | DayOfYear | DailyGross | NoReleases | GrossOfNo1Movie | SpecialOccasion | Year | No1Movie_10,000 BC | No1Movie_12 Strong | No1Movie_13 Going on 30 | ... | No1Movie_X2 | No1Movie_Yes Man | No1Movie_You Got Served | No1Movie_You've Got Mail | No1Movie_Zack and Miri Make a Porno | No1Movie_Zero Dark Thirty | No1Movie_Zombieland | No1Movie_Zootopia | No1Movie_nan | No1Movie_xXx | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Date | |||||||||||||||||||||
| 1977-05-25 | Wednesday | 145.0 | 254809.0 | 1.0 | 254809.0 | 1977 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | |
| 1977-05-26 | Thursday | 146.0 | 238965.0 | 1.0 | 238965.0 | 1977 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | |
| 1977-12-26 | Monday | 360.0 | 3026559.0 | 1.0 | 3026559.0 | 1977 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | |
| 1980-05-21 | Wednesday | 142.0 | 1333305.0 | 1.0 | 1333305.0 | 1980 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | |
| 1980-05-22 | Thursday | 143.0 | 995026.0 | 1.0 | 995026.0 | 1980 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
5 rows × 1219 columns
In the following plot, we're taking a closer look at how long movies have ruled the Box Office's top spot over the last three decades. What's intriguing is how the movie industry has evolved in nearly 30 years: There's been a surge in movie releases, especially those big-budget blockbusters that draw huge crowds. As a result, the time a movie spends as the number one Box Office champ has shrunk to less than a third of what it used to be (from 150 days for "Independence Day" to less than 50 days for "Avatar: The Way of Water").
This observation sheds light on the dynamic nature of the movie industry, with more releases and evolving audience tastes. It's a world where movies compete fiercely for that coveted top position in a landscape that's become increasingly competitive.
# Define time intervals
intervals = [('1996-2004', '1996-01-01', '2004-12-31'), ('2005-2014', '2005-01-01', '2014-12-31'), ('2015-2024', '2015-01-01', '2024-01-01')]
# Colors for the plots
colors = ['palevioletred', 'plum', 'mediumslateblue']
# Create subplots with 3 rows and 2 columns
fig, axes = plt.subplots(nrows=3, ncols=2, figsize=(18, 30))
for i, (title, start_date, end_date) in enumerate(intervals):
# Filter data for the interval
interval_data = df_no1_movie_one_hot.drop(columns=['DayOfYear','NoReleases','GrossOfNo1Movie','SpecialOccasion','DailyGross','Weekday','Year']).loc[start_date:end_date]
# Summing up the days each movie was number one
days_as_no1 = interval_data.sum().sort_values(ascending=False)
# Distribution plot for the interval
axes[i, 0].bar(days_as_no1[days_as_no1.values > 1].index, days_as_no1[days_as_no1.values > 1].values, color=colors[i])
axes[i, 0].set_title(f'Distribution of Number One Days for Each Movie ({title})')
axes[i, 0].set_xlabel('Movies')
axes[i, 0].set_ylabel('Number of Days as Number One')
axes[i, 0].set_xticks([])
# Top 15 movies plot for the interval
top_15_movies = days_as_no1.head(15)
labels = [movie.replace('No1Movie_','') for movie in top_15_movies.index]
axes[i, 1].bar(top_15_movies.index, top_15_movies.values, color=colors[i])
axes[i, 1].set_title(f'Top 15 Movies with the Longest Time as Number One ({title})')
axes[i, 1].set_xlabel('Movies')
axes[i, 1].set_ylabel('Number of Days as Number One')
axes[i, 1].set_xticks(range(15))
axes[i, 1].set_xticklabels(labels, rotation=90)
fig.tight_layout()
plt.show()
/var/folders/34/5_gd9x8s1t14rbtl0dwn2l940000gn/T/ipykernel_90777/782048247.py:24: MatplotlibDeprecationWarning: Support for passing numbers through unit converters is deprecated since 3.5 and support will be removed two minor releases later; use Axis.convert_units instead. axes[i, 0].set_xticks([]) /var/folders/34/5_gd9x8s1t14rbtl0dwn2l940000gn/T/ipykernel_90777/782048247.py:24: MatplotlibDeprecationWarning: Support for passing numbers through unit converters is deprecated since 3.5 and support will be removed two minor releases later; use Axis.convert_units instead. axes[i, 0].set_xticks([]) /var/folders/34/5_gd9x8s1t14rbtl0dwn2l940000gn/T/ipykernel_90777/782048247.py:24: MatplotlibDeprecationWarning: Support for passing numbers through unit converters is deprecated since 3.5 and support will be removed two minor releases later; use Axis.convert_units instead. axes[i, 0].set_xticks([])
When we compare the box office on regular days to holidays and other special occasions (not considering COVID), it becomes clear that holidays drive the Box Office in every year.
# Data split into regular days and special occasions
df_regular = df_combined[df_combined['SpecialOccasion']=='']
df_special = df_combined[df_combined['SpecialOccasion']!='']
df_special_nocovid = df_special[df_special['SpecialOccasion']!= 'COVID']
# Calculate mean box office for each year for regular and special days
df_gross_regular = df_regular.groupby('Year')['DailyGross'].mean()
df_gross_special_nocovid = df_special_nocovid.groupby('Year')['DailyGross'].mean()
# Plot the data for visualization
plt.figure(figsize=(15, 7))
plt.plot(df_gross_regular.index,df_gross_regular.values,color = 'skyblue',linewidth=4)
plt.plot(df_gross_special_nocovid.index,df_gross_special_nocovid.values,color = 'palevioletred',linewidth=4)
plt.legend(['Regular Days','Special Occasion'],fontsize=20)
plt.title('Box Office Average',fontsize=25)
plt.xlabel('Movies',fontsize=15)
plt.ylabel('Number of Days as Number One',fontsize=15)
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.show()
Finally, for every year, we plot the weekly development of the daily box office. Clearly, there is a weekly pattern. Fridays usually show a jump in Box Office as the weekend starts and new movies are released. On weeks without highly expected releases, Saturdays tend to have a slightly higher box office. Additionally, holidays can cause jumps in the daily box office during the week.
# Add data on the week of the year
df_combined['WeekOfYear'] = df_combined.index.isocalendar().week
# Plotting box office data by weekday for each week of the year
unique_weeks = df_combined['WeekOfYear'].unique()
# Set up the colors for the plots
colors = plt.cm.RdPu(np.linspace(0.25, 1, len(unique_weeks)))
plt.figure(figsize=(15, 10))
fig, axs = plt.subplots(13,2,figsize=(18,120))
for year in range(1998,2024):
for week, color in zip(unique_weeks, colors):
week_data = df_combined.loc[str(year)][df_combined['WeekOfYear'].loc[str(year)] == week]
axs[(year-1998)//2,year%2].plot(week_data['Weekday'], week_data['DailyGross'], marker='o', color=color, label=f'Week {week}')
axs[(year-1998)//2,year%2].set_title('Box Office Data by Weekday for Each Week of ' + str(year))
axs[(year-1998)//2,year%2].set_xlabel('Weekday')
axs[(year-1998)//2,year%2].set_ylabel('Box Office')
axs[(year-1998)//2,year%2].set_xticks(ticks=np.arange(7))
axs[(year-1998)//2,year%2].set_ylim([0,200000000])
#plt.legend()
plt.show()
/var/folders/34/5_gd9x8s1t14rbtl0dwn2l940000gn/T/ipykernel_90777/1429552293.py:13: MatplotlibDeprecationWarning: Support for passing numbers through unit converters is deprecated since 3.5 and support will be removed two minor releases later; use Axis.convert_units instead.
axs[(year-1998)//2,year%2].plot(week_data['Weekday'], week_data['DailyGross'], marker='o', color=color, label=f'Week {week}')
/var/folders/34/5_gd9x8s1t14rbtl0dwn2l940000gn/T/ipykernel_90777/1429552293.py:13: MatplotlibDeprecationWarning: Support for passing numbers through unit converters is deprecated since 3.5 and support will be removed two minor releases later; use Axis.convert_units instead.
axs[(year-1998)//2,year%2].plot(week_data['Weekday'], week_data['DailyGross'], marker='o', color=color, label=f'Week {week}')
/var/folders/34/5_gd9x8s1t14rbtl0dwn2l940000gn/T/ipykernel_90777/1429552293.py:13: MatplotlibDeprecationWarning: Support for passing numbers through unit converters is deprecated since 3.5 and support will be removed two minor releases later; use Axis.convert_units instead.
axs[(year-1998)//2,year%2].plot(week_data['Weekday'], week_data['DailyGross'], marker='o', color=color, label=f'Week {week}')
/var/folders/34/5_gd9x8s1t14rbtl0dwn2l940000gn/T/ipykernel_90777/1429552293.py:13: MatplotlibDeprecationWarning: Support for passing numbers through unit converters is deprecated since 3.5 and support will be removed two minor releases later; use Axis.convert_units instead.
axs[(year-1998)//2,year%2].plot(week_data['Weekday'], week_data['DailyGross'], marker='o', color=color, label=f'Week {week}')
/var/folders/34/5_gd9x8s1t14rbtl0dwn2l940000gn/T/ipykernel_90777/1429552293.py:13: MatplotlibDeprecationWarning: Support for passing numbers through unit converters is deprecated since 3.5 and support will be removed two minor releases later; use Axis.convert_units instead.
axs[(year-1998)//2,year%2].plot(week_data['Weekday'], week_data['DailyGross'], marker='o', color=color, label=f'Week {week}')
/var/folders/34/5_gd9x8s1t14rbtl0dwn2l940000gn/T/ipykernel_90777/1429552293.py:13: MatplotlibDeprecationWarning: Support for passing numbers through unit converters is deprecated since 3.5 and support will be removed two minor releases later; use Axis.convert_units instead.
axs[(year-1998)//2,year%2].plot(week_data['Weekday'], week_data['DailyGross'], marker='o', color=color, label=f'Week {week}')
/var/folders/34/5_gd9x8s1t14rbtl0dwn2l940000gn/T/ipykernel_90777/1429552293.py:13: MatplotlibDeprecationWarning: Support for passing numbers through unit converters is deprecated since 3.5 and support will be removed two minor releases later; use Axis.convert_units instead.
axs[(year-1998)//2,year%2].plot(week_data['Weekday'], week_data['DailyGross'], marker='o', color=color, label=f'Week {week}')
/var/folders/34/5_gd9x8s1t14rbtl0dwn2l940000gn/T/ipykernel_90777/1429552293.py:13: MatplotlibDeprecationWarning: Support for passing numbers through unit converters is deprecated since 3.5 and support will be removed two minor releases later; use Axis.convert_units instead.
axs[(year-1998)//2,year%2].plot(week_data['Weekday'], week_data['DailyGross'], marker='o', color=color, label=f'Week {week}')
/var/folders/34/5_gd9x8s1t14rbtl0dwn2l940000gn/T/ipykernel_90777/1429552293.py:13: MatplotlibDeprecationWarning: Support for passing numbers through unit converters is deprecated since 3.5 and support will be removed two minor releases later; use Axis.convert_units instead.
axs[(year-1998)//2,year%2].plot(week_data['Weekday'], week_data['DailyGross'], marker='o', color=color, label=f'Week {week}')
/var/folders/34/5_gd9x8s1t14rbtl0dwn2l940000gn/T/ipykernel_90777/1429552293.py:13: MatplotlibDeprecationWarning: Support for passing numbers through unit converters is deprecated since 3.5 and support will be removed two minor releases later; use Axis.convert_units instead.
axs[(year-1998)//2,year%2].plot(week_data['Weekday'], week_data['DailyGross'], marker='o', color=color, label=f'Week {week}')
/var/folders/34/5_gd9x8s1t14rbtl0dwn2l940000gn/T/ipykernel_90777/1429552293.py:13: MatplotlibDeprecationWarning: Support for passing numbers through unit converters is deprecated since 3.5 and support will be removed two minor releases later; use Axis.convert_units instead.
axs[(year-1998)//2,year%2].plot(week_data['Weekday'], week_data['DailyGross'], marker='o', color=color, label=f'Week {week}')
/var/folders/34/5_gd9x8s1t14rbtl0dwn2l940000gn/T/ipykernel_90777/1429552293.py:13: MatplotlibDeprecationWarning: Support for passing numbers through unit converters is deprecated since 3.5 and support will be removed two minor releases later; use Axis.convert_units instead.
axs[(year-1998)//2,year%2].plot(week_data['Weekday'], week_data['DailyGross'], marker='o', color=color, label=f'Week {week}')
/var/folders/34/5_gd9x8s1t14rbtl0dwn2l940000gn/T/ipykernel_90777/1429552293.py:13: MatplotlibDeprecationWarning: Support for passing numbers through unit converters is deprecated since 3.5 and support will be removed two minor releases later; use Axis.convert_units instead.
axs[(year-1998)//2,year%2].plot(week_data['Weekday'], week_data['DailyGross'], marker='o', color=color, label=f'Week {week}')
/var/folders/34/5_gd9x8s1t14rbtl0dwn2l940000gn/T/ipykernel_90777/1429552293.py:13: MatplotlibDeprecationWarning: Support for passing numbers through unit converters is deprecated since 3.5 and support will be removed two minor releases later; use Axis.convert_units instead.
axs[(year-1998)//2,year%2].plot(week_data['Weekday'], week_data['DailyGross'], marker='o', color=color, label=f'Week {week}')
/var/folders/34/5_gd9x8s1t14rbtl0dwn2l940000gn/T/ipykernel_90777/1429552293.py:13: MatplotlibDeprecationWarning: Support for passing numbers through unit converters is deprecated since 3.5 and support will be removed two minor releases later; use Axis.convert_units instead.
axs[(year-1998)//2,year%2].plot(week_data['Weekday'], week_data['DailyGross'], marker='o', color=color, label=f'Week {week}')
/var/folders/34/5_gd9x8s1t14rbtl0dwn2l940000gn/T/ipykernel_90777/1429552293.py:13: MatplotlibDeprecationWarning: Support for passing numbers through unit converters is deprecated since 3.5 and support will be removed two minor releases later; use Axis.convert_units instead.
axs[(year-1998)//2,year%2].plot(week_data['Weekday'], week_data['DailyGross'], marker='o', color=color, label=f'Week {week}')
<Figure size 1080x720 with 0 Axes>
Finally, we create two different forecast datasets. One including COVID data and one not including COVID data. The COVID pandemic has had a large negative impact on the box office. So, including it, will decrease the forecast that we generate. Due to the 2023 Writer's Guild of America strike, many releases of larger studios have been delayed. This has a negative impact on the daily box office as well. So, including the COVID data might do a better job in the current market.
# Create dataframes for making predictions
df_covid = df_combined.drop(columns=['No1Movie','GrossOfNo1Movie'])
df_covid = df_covid.reset_index()
# Drop data before 2002
df_covid = df_covid.drop(index=range(5401))
# Create dataframe without COVID data (2020-2021)
df_no_covid = df_covid.drop(index=range(11975,12706))
df_no_covid = df_no_covid.reset_index().drop(columns=['index'])
To find a basis for our model building, we begin by building a linear model and comparing the performance of data containing COVID data and not containing COVID data. It become clear here, that including the COVID data significantly decreases the model performance as this data does not show the same patterns as non-COVID data. In particular, if we include the COVID data, the model underestimates the forecasting data.
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_percentage_error
import numpy as np
# Assuming df is your DataFrame
# Encoding categorical data (assuming 'Weekday' is the only categorical feature for simplicity)
df_encoded = pd.get_dummies(df_covid, columns=['Weekday'])
# Splitting the data
train_df = df_encoded.iloc[:-7] # All data except the last week
test_df = df_encoded.iloc[-7:] # This week
# Selecting features and target
X_train = train_df.drop(['DailyGross', 'Date','SpecialOccasion','WeekOfYear'], axis=1)
y_train = train_df['DailyGross']
X_test = test_df.drop(['DailyGross', 'Date','SpecialOccasion','WeekOfYear'], axis=1)
y_test = test_df['DailyGross']
# Model Training
model = LinearRegression()
model.fit(X_train, y_train)
# Prediction
predictions = model.predict(X_test)
# Evaluation
mape = mean_absolute_percentage_error(y_test, predictions)
mse = mean_squared_error(y_test,predictions)
rmse = np.sqrt(mse)
print('The MAPE of the linear regression model (trained on full data) on the test data is: ',round(mape,2))
print('The RMSE of the linear regression model (trained on full data) on the test data is: ',round(rmse,2))
The MAPE of the linear regression model (trained on full data) on the test data is: 0.64 The RMSE of the linear regression model (trained on full data) on the test data is: 9918117.74
# Plotting actual vs predicted
plt.figure(figsize=(15, 7))
plt.plot(df_encoded.iloc[-150:].index, df_encoded.iloc[-150:].DailyGross, marker='o',markersize=10,label='Training Data',linewidth=3,color='cornflowerblue')
plt.plot(X_test.index, predictions, label='Predictions', color='fuchsia',linewidth=3,marker='^',markersize=10)
plt.title('Daily Box Office Gross: Actual vs Predictions (COVID data included)')
plt.xlabel('Date')
plt.ylabel('Daily Gross')
plt.legend()
plt.show()
# Assuming df is your DataFrame
# Encoding categorical data (assuming 'Weekday' is the only categorical feature for simplicity)
df_encoded = pd.get_dummies(df_no_covid.reset_index().drop(columns='index'), columns=['Weekday'])
# Splitting the data
train_df = df_encoded.iloc[:-7] # All data except the last week
test_df = df_encoded.iloc[-7:] # This week
# Selecting features and target
X_train = train_df.drop(['DailyGross', 'Date','SpecialOccasion','WeekOfYear'], axis=1)
y_train = train_df['DailyGross']
X_test = test_df.drop(['DailyGross', 'Date','SpecialOccasion','WeekOfYear'], axis=1)
y_test = test_df['DailyGross']
# Model Training
model = LinearRegression()
model.fit(X_train, y_train)
# Prediction
predictions = model.predict(X_test)
# Evaluation
mape = mean_absolute_percentage_error(y_test, predictions)
mse = mean_squared_error(y_test,predictions)
rmse = np.sqrt(mse)
print('The MAPE of the linear regression model (trained on restricted data) on the test data is: ',round(mape,2))
print('The RMSE of the linear regression model (trained on restricted data) on the test data is: ',round(rmse,2))
The MAPE of the linear regression model (trained on restricted data) on the test data is: 1.3 The RMSE of the linear regression model (trained on restricted data) on the test data is: 17359543.89
Observe that including the COVID data does a better job at predicting on the test data. So, here, we choose to work with the COVID data based on the simple linear Regression model. The next step is to find a better model. In particular, we use SARIMAX -- Seasonal model that combined Auto Regression with a Moving Average and eXogneous features.
The next step is to gain a deeper understanding into trends and seasonal behavior.
Including the COVID data, means that the trend (capturing only linear behavior) does not work well. The strong decrease of the box office during COVID can not very well be explained by a linear trend. When we exclude COVID data, the moving average still declines below the linear regression trend.
df = pd.get_dummies(df_covid, columns=['Weekday']) # Include COVID data
# 1. Calculate Moving Average
moving_average = df['DailyGross'].rolling(window=365,center=True).mean()
# 2. Linear Trend Forecast Plot
df['DateNum'] = df['Date'].apply(lambda x: x.toordinal()) # Convert dates to ordinal numbers for linear regression
model = LinearRegression()
model.fit(df[['DateNum']], df['DailyGross'])
# Predictions for trend line
df['Trend'] = model.predict(df[['DateNum']])
# Plotting the data and trend
plt.figure(figsize=(15, 7))
plt.plot(df.index, df['DailyGross'], label='Actual Data', color='cornflowerblue')
plt.plot(df.index, df['Trend'], label='Trend', color='fuchsia', linewidth=3)
plt.plot(df.index, moving_average,color='white',linewidth=3)
plt.title('Box Office Data with Linear Trend Line')
plt.xlabel('Date')
plt.ylabel('Daily Gross')
plt.legend()
plt.show()
df_encoded = df_encoded.reset_index().drop(columns='index')
Examining the periodogram indicates the presence of three forms of cyclical behavior:
While dominated by the weekly cycles, we also observe half-weekly and half-yearly cycles.
import pandas as pd
import numpy as np
from scipy.signal import periodogram
import matplotlib.pyplot as plt
from matplotlib.ticker import FuncFormatter
# Assuming 'df' is your DataFrame and it has 'DailyGross' column with time series data
# Example: df = pd.read_csv('path_to_your_data.csv')
# Calculate the periodogram
fs=1
frequencies, spectrum = periodogram(df_covid.DailyGross,fs)
# Convert frequencies to periods (days per cycle)
periods = 1 / frequencies
# Plot the periodogram
plt.figure(figsize=(14, 6))
plt.plot(periods, spectrum, color='skyblue',linewidth=5)
plt.xscale('log')
#plt.yscale('log')
plt.title('Periodogram')
plt.xlabel('Period (days per cycle)')
plt.ylabel('Variance')
# Set x-axis to show periods as daily, weekly, monthly
formatter = FuncFormatter(lambda x, _: f'{int(x)}')
plt.gca().xaxis.set_major_formatter(formatter)
# Highlight daily, weekly, and monthly frequencies
half_weekly_freq = 2/7
weekly_freq = 1/7
half_yearly_freq = 2/365
count = 0
color = ['darkviolet','fuchsia','crimson']
for freq, label in zip([half_weekly_freq, weekly_freq, half_yearly_freq], ['Halfweekly', 'Weekly','Half-Yearly']):
plt.axvline(1/freq, color=color[count], linestyle=':', alpha=0.7, label=f'{label} (1/{1/freq:.0f})',linewidth=5)
count+=1
plt.legend()
plt.show()
/var/folders/34/5_gd9x8s1t14rbtl0dwn2l940000gn/T/ipykernel_90777/3860585986.py:15: RuntimeWarning: divide by zero encountered in true_divide periods = 1 / frequencies
Even though we model the domestic box office, including both US as well as Chinese holidays, increases the performance of the model. Other holidays do not increase the performance of the model on the validation data.
from datetime import date
import holidays
# Create list of holidays for the two largest contributors to box office
holidays_china = holidays.China(years = range(2002,2025)).items()
holidays_us = holidays.UnitedStates(years = range(2002,2025)).items()
# Create column in pd dateformat
df['DateTime'] =[pd.to_datetime(day).date() for day in df['Date']]
# Initialize columns for holidays
df['Holiday_US'] = 0
df['Holiday_China'] = 0
# One-hot encode holidays
for holiday in holidays_us:
df.loc[df['DateTime'] == holiday[0], 'Holiday_US'] = 1
for holiday in holidays_china:
df.loc[df['DateTime'] == holiday[0], 'Holiday_China'] = 1
We include half-weekly as well as helf-yearly features through sine and cosine.
# Half-weekly cycle: Using sine and cosine transformations with a period of 3.5 days (half a week)
df['sin_half_week'] = np.sin(2 * np.pi * df.Date.dt.dayofweek / 3.5)
df['cos_half_week'] = np.cos(2 * np.pi * df.Date.dt.dayofweek / 3.5)
# Half-yearly cycle: Using sine and cosine transformations with a period of 182.5 days (half a year)
day_of_year = df.DayOfYear
df['sin_half_year'] = np.sin(2 * np.pi * day_of_year / 365.25 * 2)
df['cos_half_year'] = np.cos(2 * np.pi * day_of_year / 365.25 * 2)
Finally, we build the SARIMAX model and hypertune it on the validation data.
exog_features = ['Weekday_Monday', 'Weekday_Tuesday', 'Weekday_Wednesday', 'Weekday_Thursday',
'Weekday_Friday', 'Weekday_Saturday', 'Weekday_Sunday', 'sin_half_week',
'cos_half_week', 'sin_half_year',
'cos_half_year','Holiday_US','Holiday_China']
# Splitting the data into train and test sets
# The test set is the last week of the dataframe
train = df.iloc[:-7]
test = df.iloc[-7:]
# Target variable
y_train = train['DailyGross']
y_test = test['DailyGross']
exog_test = test[exog_features]
model = SARIMAX(y_train, order=(1, 1, 1), seasonal_order=(1, 1, 1, 7), # Weekly seasonality
exog=train[exog_features])
results = model.fit()
predictions = results.get_forecast(steps=7, exog=exog_test)
forecast_mean = predictions.predicted_mean
predictions_ci = predictions.conf_int()
# Plotting the results
plt.figure(figsize=(20, 12))
plt.plot(df.iloc[-150:].DailyGross,marker='o',markersize=10,label='Training Data',linewidth=3,color='cornflowerblue')
#plt.plot(test['DailyGross'], label='Actual Sales',linewidth=5)
plt.plot(forecast_mean,'--',label='Predicted Sales', color='fuchsia',linewidth=3,marker='^',markersize=10)
plt.fill_between(predictions_ci.index, predictions_ci.iloc[:, 0], predictions_ci.iloc[:, 1], color='k', alpha=.2)
plt.title('SARIMAX Model Predictions vs Actual')
plt.legend()
plt.show()
RUNNING THE L-BFGS-B CODE
* * *
Machine precision = 2.220D-16
N = 19 M = 10
At X0 0 variables are exactly at the bounds
At iterate 0 f= 1.73837D+01 |proj g|= 6.00908D-02
This problem is unconstrained.
At iterate 5 f= 1.73498D+01 |proj g|= 1.36699D-02
At iterate 10 f= 1.73429D+01 |proj g|= 2.68181D-03
At iterate 15 f= 1.73412D+01 |proj g|= 5.45679D-04
At iterate 20 f= 1.73412D+01 |proj g|= 5.88306D-04
* * *
Tit = total number of iterations
Tnf = total number of function evaluations
Tnint = total number of segments explored during Cauchy searches
Skip = number of BFGS updates skipped
Nact = number of active bounds at final generalized Cauchy point
Projg = norm of the final projected gradient
F = final function value
* * *
N Tit Tnf Tnint Skip Nact Projg F
19 22 24 1 0 0 1.008D-06 1.734D+01
F = 17.341196901115687
CONVERGENCE: NORM_OF_PROJECTED_GRADIENT_<=_PGTOL
# Calculating MAPE and RMSE
mape = mean_absolute_percentage_error(y_test, forecast_mean)
mse = mean_squared_error(y_test,forecast_mean)
rmse = np.sqrt(mse)
print('The MAPE of the hyptertuned SARIMAX model on the test data is: ',round(mape,2))
print('The RMSE of the hyptertuned SARIMAX model on the test data is: ',round(rmse,2))
The MAPE of the hyptertuned SARIMAX model on the test data is: 0.41 The RMSE of the hyptertuned SARIMAX model on the test data is: 5657773.21
The SARIMAX model does a reasonable job at predicting the patterns of box office behavior. It captures the weekly up and downs more clearly than the linear model. However, there is still a lot of room for improvement. Despite implementing other models (like LSTM), it might be of interest to consider other features. For example, it could be interesting to incorporate social media trends into the data set to get better predictions of significant jumps (like "Barbenheimer" in July 2023).